Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis

Clinical heterogeneity is common in Mendelian disease, but small sample sizes make it difficult to identify specific contributing factors. However, if a disease represents the severely affected extreme of a spectrum of phenotypic variation, then modifier effects may be apparent within a larger subset of the population. Analyses that take advantage of this full spectrum could have substantially increased power. To test this, we developed cryptic phenotype analysis, a model-based approach that infers quantitative traits that capture disease-related phenotypic variability using qualitative symptom data. By applying this approach to 50 Mendelian diseases in two cohorts, we identify traits that reliably quantify disease severity. We then conduct genome-wide association analyses for five of the inferred cryptic phenotypes, uncovering common variation that is predictive of Mendelian disease-related diagnoses and outcomes. Overall, this study highlights the utility of computationally-derived phenotypes and biobank-scale cohorts for investigating the complex genetic architecture of Mendelian diseases.


UK Biobank
The bulk UK Biobank (UKBB) dataset was downloaded on Jan 22nd, 2020 using the software provided by the organization 5 . Following download, the raw data file was parsed, isolating demographic variables of diseases. By integrating this manually curated set with the diseases annotated by the Human Phenotype Ontology 8 (HPO), we generated a raw list of 166 rare disease phenotypes aligned to both the ICD10-CM and HPO terminologies.
From this set of 166 rare diseases, we further filtered them according to: 1. Their prevalence within the UCSF-CDW 2. Their number of annotated HPO terms More specifically, we included all diseases for downstream analyses that had a population prevalence of at least 5 × 10 −5 . In addition, we limited our analyses to those diseases with at least 5 annotated HPO terms, as our model for latent phenotype inference showed instability for diseases annotated with less HPO terms (see below). The final set of included diseases (along with modes of inheritance, prevalence and assigned ICD10/ICD10-CM codes) is provided as SupplementaryDataFile 1.txt.

Alignment of the Human Phenotype Ontology to the ICD10-CM Terminology
We aligned the Human Phenotype Ontology 9 (HPO) [20] to the ICD10-CM terminology [7] in an automated fashion using multiple sources of data. First, we downloaded the HPO-to-ICD10 and HPO-to-ICD9 mappings (obtained through lexical matching using the LOOM tool [21]) from the National Center for Biomedical Ontology's BioPortal 10 [22]. We added to these strict lexical matches by integrating the mappings obtained from two other ontologies: the UMLS Metathesaurus 11 [23] and the SNOMED-CT-to-ICD10 mappings obtained from the US Edition of SNOMED 12 [24]. To further expand the final set of HPO-to-ICD10-CM matches, we also downloaded an ICD9-to-ICD10 map from the National Bureau for Economics Research 13 , enabling us to generate HPO-to-ICD10-CM mappings transitively through the ICD9 encoding. Finally, similar to [25], we used a partial logical expansion strategy to increase the coverage of our HPO-to-ICD10-CM mappings. More specifically, for cases in which an HPO term had no ICD10 annotations, we aligned it with the ICD10 annotations of its parent as long as the following criteria were met: 1. A child term has only a single parent term 2. The parent term only has a single child term Overall, the integration of these different datasets plus post-processing enabled the successful alignment of 3, 166 HPO terms (total number of terms in the HPO version downloaded on 8/8/2019: 14, 707) and yielded 4, 922 HPO-to-ICD10 alignments. We further the filtered this list of aligned HPO terms by: 1. Removing terms that were aligned to the ICD10-CM code for a rare disease of interest 2. Removing terms aligned to a disallowed ICD-10-CM code 3. Collapsing two or more HPO terms that aligned to the same ICD-10-CM code (or a strict subset codes) This reduced the number of unique HPO terms (or combinations of terms) to 1, 674. The final HPO-to-ICD10-CM alignments used in this study are provided in SupplementaryDataFile 2.txt. We used the phenotype annotations available through the HPO website 14 to assign symptoms (in the form of HPO terms) to the rare diseases of interest (see Supplementary Figure 1 to see a global overview of the symptom-to-disease mappings). Although we did not manually review the assignments, we did evaluate their performance on a simple rare disease diagnosis prediction task. More specifically, ElasticNet classifiers (implemented in sklearn [26]) for rare disease diagnoses were constructed in the UCSF training datasets using three sets of features: annotated HPO terms, all HPO terms, and the full ICD10 codebook. The 8 https://hpo.jax.org/app/download/ontology 9 https://hpo.jax.org/app/download/ontology 10 https://bioportal.bioontology.org/ 11 https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html 12 File: tls Icd10cmHumanReadableMap US1000124 20190301.tsv; https://www.nlm.nih.gov/healthit/snomedct/us_ edition.html 13 https://www.nber.org/data/icd9-icd-10-cm-and-pcs-crosswalk-general-equivalence-mapping.html 14 https://hpo.jax.org/app/download/annotation classifiers built using the annotated HPO terms performed substantially better than random (assessed using the average precision score [27]) for nearly all of the included disorders (Supplementary Figure 2a). Moreover, although the classifiers constructed using the full ICD10/HPO codebooks performed systematically better on average ( Supplementary Figure 2b), the models constructed using the annotated HPO terms were statistically indistinguishable (based on overlapping 95% confidence intervals) from those constructed using the complete ICD10 codebook (which includes >10,000 features) for 39 of 50 disorders (78%, see Supplementary Table 1).  Figure 1: A Sankey diagram depicting the alignments between 166 rare, Mendelian diseases (left side) and their HPO-term encoded symptoms (right side). On the left, the Mendelian diseases were grouped according to their parent ICD10-CM chapters such that each colored bar indicates a distinct grouping of disorders, and the height of the bar indicates the total number of diseases in each group. Similarly, on the right side, the HPO symptoms were grouped according to their most general ancestor in the ontology. Annotations between the groups of diseases and symptoms are represented as curved lines of various thickness, with the width of each line proportional to the total number of annotations linking the two groups.

Cryptic Phenotype Analysis
Our approach to identifying modifiers of Mendelian disease severity relies on the detection of a quantitative but latent (or cryptic) spectrum of variation that drives the observed symptoms of the disorder (i.e. its morbidity or expressivity). However, there are no guarantees that such a spectrum exists. Therefore, we utilized a two-stage process to identify Mendelian diseases that may benefit from this spectrum-based approach. First, we performed statistical inference to estimate the latent phenotypes underlying a set of observed symptoms (see Supplementary Figure 5, left side) in two distinct datasets (UCSF and UKBB). Then, we performed extensive evaluations of the inferred phenotypes to ensure that they: 1) captured the severity of the Mendelian disease of interest, and 2) were consistent across datasets. Cryptic Phenotype Analysis (CPA) refers to the complete two-stage process.

A Generative Model for Disease Symptoms
Consider the set of K symptoms that are associated with some Mendelian disease of interest. We assume that all of symptoms are binary (present/absent) and permanent (i.e. once diagnosed, they do not resolve). Let S i,j denote the status of the jth symptom in the ith subject such that S i,j = 1 indicates that the patient Performance Across Terminologies a b Supplementary Figure 2: Classifiers to predict rare disease diagnoses in the UCSF dataset were constructed from three different terminologies: annotated HPO terms, all HPO terms, and all ICD10-CM codes. (a): The average precision score (evaluated in the testing subset) for the three different classifiers is compared across all 50 rare diseases included in the study. The average precision score is normalized by the prevalence of each diagnosis, which approximates the average precision score of a random classifier [28]. For reference, the prevalence of each diagnosis is plotted to the right (red line denotes the mean prevalence across all diseases). All classifiers were constructed using the LogisticRegression function implemented in sklearn [26], invoking the elasticnet penalty and the following hyper-parameter settings: max iter=500, tol=0.001, solver='saga', C=1.0, and l1 ratio=0.5. (b): Boxplots comparing the average precision score distributions across the 3 different terminologies for the 50 diseases included in this study. The boxes themselves represents the interquartile range for the performance statistics, and the central horizontal lines represent the medians. The notches in the boxplots provide a bootstrapped estimate for the 95% confidence interval of the median (1,000 re-samples). The whiskers indicate the range for the observed values.
Supplementary has received this diagnosis. Furthermore, let S denote an N × K-dimensional matrix of symptom diagnoses such that the ith row of the matrix (denoted S i ) contains the diagnoses for subject i, and let Z denote an N × L matrix of latent phenotype values, where each column of the matrix represents the magnitude (i.e. severity) of a distinct latent phenotype. We assume that the binary symptom matrix is generated stochastically from the latent phenotypes according to: where f (Z; θ) is a function (defined by the parameters θ) that maps the latent phenotypes onto a matrix of symptom probabilities (i.e. f (Z; θ) ∈ [0, 1] N ×K ). We construct a joint model for S and Z by specifying a generative probability distribution for the latent phenotypes themselves: where ϕ is the set of parameters defining the prior distribution for the latent phenotypes.
To fully define the joint model, we must precisely specify the symptom risk function (f (Z; θ)) and the prior distribution for Z (P (Z|ϕ)). Starting with the latter, we desire a phenotype model that produces continuous random variables, is computationally tractable, and assumes that the L distinct latent phenotypes are independent of one another. This last assumption is important because it aids in identifiability. Our goal is to isolate latent phenotypes that best explain the symptom variability observed within each Mendelian disease. If our model happens to decompose a set of symptoms into multiple phenotypes, then we want them to be independent of one another in order maximize our chances of isolating a single latent component capable of explaining this variability. Therefore, in the current study, we used the isotropic standard gaussian as the prior distribution for the latent phenotypes: where I is the L×L-dimensional identity matrix. Note, there is no advantage to relaxing the constant variance assumption implicit in this construction; the symptom risk function can always systematically re-scale the latent phenotype values to allow for non-constant variance across the different dimensions.
With respect to the function f (Z; θ), we modeled symptom risks using the following generalized linear model: where B denotes a K-dimensional vector of symptom-specific offset terms and W represents an L × Kdimensional matrix of latent phenotype component weights (i.e. θ = {B, W}). To simplify the interpretation of the inferred latent phenotypes, we restricted the values of W to be non-negative such that w l,k ≥ 0, ∀l, k. This restriction was in invoked for two reasons. First, it ensured that all of the symptoms annotated to a Mendelian disease were positively correlated, which should be generally be true by definition. Second, it aided in the interpretation of the inferred latent phenotypes, as it ensured that individuals with large, positive latent phenotype values had the highest risk of being diagnosed with multiple, rare disease symptoms (i.e. the greatest morbidity). Note, more complicated symptom risk functions are possible (ex: multi-layered non-linear neural network), but in our preliminary analyses, such functions were associated with substantially increased complexity while adding little in terms of expressivity. Further research is required to determine if and/or when more complicated risk functions are necessary to capture the correlation structure among disease symptoms.

A Variational Approach to Model Inference
Ultimately, we are interested in producing estimates for the latent phenotype matrix (denotedẐ) given some observed symptom matrix (denoted S = s). This idea of summarizing an observed dataset with a lower-dimensional matrix of latent variables is also known as matrix factorization (or factor analysis), and there are multiple examples of its successful application to a diverse set of problems in computation biology and genetics [29,30,31,32]. Although many general-purpose matrix factorization algorithms have been developed, most were designed for continuously-distributed observed data and/or do not impose independence assumptions among the latent variables. Therefore, we developed an approximate, variational Bayesian inference algorithm that is specific to the model outlined in Equation 1 using an approach that has become popular within the machine learning community due to its scalability and overall performance [33,34]. More specifically, our goal was to infer a posterior distribution over Z given the observed symptom matrix S = s and the parameters θ (i.e. P (Z|s, θ)). To do so, we first defined the likelihood of the observed data for our model: where f (Z; θ) i,j denotes element-(i, j) of the matrix produced by the symptom risk function. The joint likelihood for the data and the latent phenotypes is: and to specify the desired posterior distribution, we need to compute the following marginal likelihood (also called the model evidence): Because this integral is analytically intractable, we produced a lower bound approximation to the model evidence (denoted ELBO) by invoking Jensen's Inequality: where q(Z|s, ψ) denotes an approximate posterior distribution over Z defined by the parameter set ψ, E G F denotes the expectation of F with respect to G, and KL F ||G denotes the Kullback-Leibler (KL)-divergence from G to F . By optimizing this lower bound with respect θ and ψ, we obtain an estimate of the model marginal likelihood (i.e. the evidence) while simultaneously learning an approximate posterior distribution over the latent phenotypes. Further details regarding the general variational approximation and its application to statistical models can be found in [35].
In order for the lower bound defined in Equation 4 to be practically useful, we need specify the approximate posterior distribution q(Z|s, ψ). One approach to defining this function is to invoke the mean field approximation, which restricts the joint posterior distribution over the latent variables to be conditionally independent given a set of variational parameters (i.e. q(Z|s, ψ) = i q(Z i |s i , ψ i )). Although the optimal solution for q(Z|s, ψ) under the mean field approximation can sometimes be derived analytically [35], this is not true for the model defined in Equation 1. Therefore, we used the approach outlined in [33,34] and directly set q(Z i |s, ψ) to have a specific functional form: a multi-layered non-linear neural network 15 .
There are several advantages to this approach. First, instead of optimizing a unique set of variational parameters for each observed datapoint (as is the case for the mean field approximation), the posterior distribution for every datapoint arises from a deterministic function that relies on the same set of parameters. This is known as amortized inference [34], and it enables variational inference algorithms to scale to very large datasets. Second, optimization problems involving non-linear neural networks have become common in the machine learning community [36], and multiple, highly-flexible software libraries now exist that enable the rapid development of scalable optimization algorithms for non-linear neural networks [37] and complex probability models [38].

Model Inference in Practice
As discussed above, model inference using the variational approach proceeds by finding the parameter values (denotedθ andψ) that maximize the lower bound specified in Equation 4. When a neural network is chosen as the approximate posterior distribution, this objective function is generally maximized using some version of stochastic gradient descent, an iterative algorithm that scales well to large datasets and complex models [36]. Although theoretically guaranteed to converge to a local optimum under certain conditions, the performance of stochastic gradient descent can vary widely across datasets, models, and implementations [36].
To mitigate some of this variability, we aimed to design an inference algorithm that was flexible and enabled excellent any-time performance, as we wanted to rapidly deploy our statistical model to many different sets of disease symptoms without having to individually tune the algorithm's hyper-parameters. In our implementation, we utilized an adaptive gradient descent method that includes a penalty term to avoid overfitting 16 (AdamW, see [39]). With respect to the learning rate of the algorithm, we initially started with a very small value, which was increased towards a large, maximum learning rate prior to decreasing the rate down to a small value once again (one-cycle learning rate scheduler). This particular approach has been shown to enable fast and accurate training of complex neural networks [40], and we found this to be true for our model as well.
In some cases, however, we found that different runs of the optimization algorithm could yield substantially different results, particularly with respect to the symptom risk function parameters (θ). This observation appeared to be driven by convergence to local modes in the objective function, likely exacerbated by the stochastic nature of the inference algorithm. To avoid local modes that tended to result in simpler but sub-optimal models, we utilized the annealing approach described in [41] to slowly incorporate the KL-divergence term into the lower bound specified in Equation 4. We found that this annealing approach substantially improved the consistency of the inferred models across different initializations (see below). Finally, by default, we monitored for algorithm convergence by computing the (untempered) ELBO on a held-out sample of data (termed the validation dataset). Unless otherwise noted, we withheld 20% of the training dataset for validation. Note, this validation dataset is distinct from the testing dataset, which was only used in the final stage of our cryptic phenotype analysis (see below). The inference algorithm outlined above was implemented using the Pyro Probabilistic Programming Language [38] and PyTorch [37]. Additional details regarding the algorithm, including default hyper-parameters (which were used unless otherwise noted), can here can be found on Github: https://github.com/daverblair/vlpi.

Assessing Model Convergence
The inference algorithm described above was designed in order maximize inferential consistency without having to tune a unique set of hyper-parameters for each symptom dataset. That said, inconsistent inference results were still possible, especially given the stochastic nature of the algorithm coupled with the multimodality of the objective function. Therefore, we wanted to ensure that the model inference results were consistent and robust across multiple model fitting attempts.
To do so, we fit the latent phenotype model to the symptom sets corresponding to each rare disease using 20 different trials of our stochastic inference algorithm. Each trial used the same set of algorithmic hyperparameters (maximum learning rate: 0.02, mini-batch size: 5000, max. epochs: 2000, convergence error tol.: 10 −6 ). To systematically compare models across different inference runs, we developed two statistics to assess their similarity. First, we defined a simple measurement that compares the quality of the model fits within our training dataset. More specifically, we computed the per-datum model perplexity for every subject according to: where the superscript q is used to indicate that these parameters define the model obtained from the qth trial. In practice, the integral in Equation 5 cannot be computed analytically, so we computed P(s i |θ, ψ) using a stochastic (Monte Carlo) approximation: where Z j denotes the jth of M random samples obtained from approximate posterior defined by q(Z|s, ψ). By default, we set M = 10; we experimented with larger values of M but found no obvious improvement in performance. To compare the quality of two models, we simply computed the following paired perplexity statistic: where P(θ q , ψ q ||θ r , ψ r ) > 0 indicates that qth model has systematically higher perplexity (i.e. results in a poorer fit) than the rth model. Practically speaking, there was a fair amount noise associated with this statistic, so we used bootstrapped resampling [42] to estimate a 95% confidence interval. The top-performing model was selected as the inference trial that resulted in the lowest perplexity after accounting for sampling error (i.e. the 95% CI excluded 0 after multiple testing correction); ties were broken arbitrarily. The previously described statistic allowed us to select a top performing model while also determining if multiple models were recovered that were of similar quality. However, it does not directly address the issue of multimodality, as we could theoretically recover models of similar quality but with different parameter estimates. To assess parameter similarity, we took advantage of the fact that our linear symptom risk function is simply a matrix of component weights (W), and therefore, models that converged to similar modes in the objective function should have similar risk functions. To assess risk function similarity, we developed a statistic based on the coefficient of determination (R 2 ). First, we aligned the risk matrices from the qth and rth model (denoted W q and W r ) by solving the orthogonal procrustes problem [43], which is necessary because the models have rotational invariance. Then, we computed the similarity between two symptom risk functions as follows. Let R 2 (X, Y) denote the coefficient of determination between the two vectors X and Y. The similarity between two symptom risk matrices was computed according to: where the subscript l indexes the latent components and w l is a weighting parameter defined as: The weighting parameters are necessary in order to decrease the contribution from shadow components (latent phenotypes whose component weights in the symptom risk function are essentially 0, see below for details), which can lead to an irrelevant degradation in the statistic. In practice, after identifying the top performing model based on the per-datum perplexity, we then computed the pairwise R 2 statistic across all 20 trials (see Supplementary Figure 3a-3c for examples). Models were deemed to be adequately consistent if there was a cluster of at least 3 trials (including the topperformer) with an average R 2 measurement ≥ 0.8. For some diseases (ex: Alpha-1-antitrypsin Defiency, see Supplementary Figure 3a), the algorithm was incredibly consistent and returned nearly identical models across all of the trials. For other diseases (ex: Hereditary Hemorrhagic Telangiectasia, see Supplementary  Figure 3b), there was less overall consistency but many of the inferred models were nearly identical. In a handful of cases, the algorithm clearly detected multiple, disparate modes across trials but still met our consistency criteria (ex: von Willenbrand Disease, Supplementary Figure 3c). Overall, the symptom sets for 42 of the 50 diseases (85%) included in our analyses met the consistency criteria within the UCSF dataset using our baseline algorithmic hyper-parameters For the remaining diseases, manual review revealed that many of them were aligned to a large number of symptoms that were loosely or even entirely unassociated with the disease based on clinical knowledge.
We hypothesized that the inclusion of large numbers of unrelated symptoms was detrimental to inference, as it forced the model to try to account for an unnecessarily complex correlation structure (see below and Supplementary Figure 4). Therefore, we manually curated the symptoms aligned to the 8 diseases that failed our consistency tests and re-ran our inference algorithms. This resulted in the convergence of an additional 4 diseases. However, the symptom sets for 4 diseases still failed to converge (ex: Marfan Syndrome, see Supplementary Figure 3d and 3e for fitting results using the initial and revised symptom sets, respectively). In these cases, we attempted to improve convergence by increasing the maximum learning rate (0.05) and the maximum number of training epochs (4000). On simulated data, we found that such changes to the hyperparameters often improved the consistency of the model fits by allowing training to occur over a greater number of epochs. In the case of our actual symptom datasets, these changes enabled 1 of the 4 remaining diseases to meet our convergence criteria within the UCSF dataset (Marfan Syndrome, see Supplementary  Figure 3d-f).
Note, this same procedure was used to assess model convergence in the UKBB. However, only 38 of the 50 phenotype models converged in this dataset. The discrepancy in convergence is likely multifactorial, and possible causes include the UKBB's smaller sample size, lower number of available features, and differences in demographics/acquisition.

Estimating Effective Rank
Similar to algorithmic hyper-parameters, the selection of model hyper-parameters (model parameters not optimized during learning) can also impact inference results. With respect to the model defined in Equation  1, there is actually only a single hyper-parameter 17 , which is the number of latent phenotypes included into the model (denoted L). A simple approach to selecting the number of latent phenotypes would be to set L = 1 and isolate a single latent phenotype for each set of rare disease symptoms. In fact, the Phenotype Risk Score defined in [15] exactly corresponds to a particular solution of the Non-Negative Matrix Factorization problem [45] in which the data matrix (in this case, a log-prevalence-transformed symptom matrix) is decomposed into only a single latent factor 18 , which may explain some of its success. However, many sets of symptoms are unlikely to be explained by a single latent phenotype, particularly those that cross multiple organ systems. For example, Hereditary Hemorrhagic Telangiectasia (HHT) is a rare, vascular dysplasia characterized by the development of arteriovenous malformations and telangiectasias throughout the body. Due to a propensity for these malformations to rupture, individuals with the disorder are at increased risk for a variety of related symptoms including epistaxis and GI hemorrhage. However, these same symptoms are observed in patients with, for example, portal hypertension and coagulopathy secondary to cirrhotic liver disease [46]. Therefore, attributing all of the correlation structure among the symptoms of HHT to a single underlying latent phenotype could result in noisy and/or biased inferences.
Somewhat surprisingly, the optimization algorithm outlined above is able to automatically select the appropriate number of latent phenotypes by dropping unnecessary latent components from the symptom risk function (somewhat akin to automatic relevance determination). In other words, when the lth latent phenotype does not add useful information to the model, then the elements of the lth row of latent phenotype loading matrix (W) converge to 0 . This automated elimination of unnecessary latent phenotypes is likely the product of two features of the model and inference algorithm: the KL-divergence term in the loss function and the penalty parameter in the adaptive gradient estimator [39]. In practice, however, the rows of W that map to unnecessary latent phenotypes may converge slowly to zero, and it can be challenging to determine whether a latent phenotype is truly unnecessary by simply visually examining W. Therefore, we developed a formal approach for detecting and removing unnecessary latent phenotypes by estimating an effective rank for our inferred latent phenotype models.
Briefly, let S Log-Odds denote the log-odds matrix for the observed symptom data after producing estimates of the model parameters (denotedθ) and latent phenotypesẐ: whereŴ arises directly from the optimization of the lower bound specified in Equation 4 andẐ = E q(Z|s,ψ) Z . To estimate the effective rank of the model, we first compute the singular value decomposition [47] of the symptom log-odds matrix such that: where U and V denote the left (N × K-dimensional) and right (K × K-dimensional) singular vectors respectively and D is a K × K-dimensional diagonal matrix of singular values. Importantly, the relative fraction of the variance explained by each of the K singular vectors is given by the following simple equation [47]: where d is the diagonal of the singular value matrix. We define a model's effect rank (denoted L eff ) as the number of singular vectors of the symptom log-odds matrix that account for at least ρ × 100% of the total variance: where δ(X) returns 1 is X is true and 0 otherwise. For our cryptic phenotype models, we set ρ = 10 −6 . On simulated data, we found that L eff provided an accurate estimate of the number of latent phenotypes underlying a set of observed symptoms.
With respect to the actual diseases, the number of inferred latent phenotypes L eff was found to be correlated with the number of HPO-annotated symptoms (see Supplementary Figure 4). This isn't entirely surprising, as larger sets of symptoms have a greater chance of capturing multiple pathological processes. This relationship is important to keep in mind when applying latent phenotype models to datasets with large numbers of symptoms, as it can be challenging to consistently infer models with very complex latent phenotypic structures.

Cryptic Phenotype Identification
Although they generally provide better fits to complex symptom datasets, there are downsides to inferring latent phenotype models with multiple components. With respect to our particular application, we are interested in estimating rare disease morbidity using a single, quantitative latent phenotype inferred from some observed symptom data. When multiple latent phenotypes are inferred, a single latent component must be selected as the top candidate. In some cases, this can be done by simply examining the symptom risk function: the component weights corresponding to the latent phenotype that best captures rare disease morbidity generally align with clinical knowledge of symptom frequency. To automate this process, we took advantage of the fact that the rare diseases in our target set (see SupplementaryDataFile 1.text) were aligned to specific ICD10 codes. These codes, although likely noisy, can serve as proxies for rare disease diagnoses in the population. Therefore, to assign rare diseases to the appropriate cryptic phenotype, we simply identified the latent component in the inferred model that was most predictive of rare disease diagnoses in our training dataset. In other words, the cryptic phenotype was identified as the latent component that generated symptom frequencies (i.e morbidity) that were the most consistent with observed clinical diagnoses.
We produced estimates of the latent phenotypes using the approximate posterior distribution obtained during model inference:Ẑ = E q(Z|s,ψ) Z . Following estimation, we sorted the latent phenotypes according to the magnitude of their component symptom weight vectors (i.e. ||W l || for the lth latent phenotype), and we used the top L eff latent phenotypes as classifiers to predict rare disease diagnoses in our training dataset. The performance of the latent phenotypes at this task was assessed using the average precision score a b Supplementary Figure 4: For the rare diseases included in this study, the effective rank of the top performing latent phenotype model is plotted against the total number of annotated symptoms (N =38 diseases, which includes those with cryptic phenotype models that converged in the UCSF and UKBB datasets; see main Figure 1c). The fraction of variance threshold for determining the effective rank was set to 10 −6 . The lines in the panels represent the expected value of the effective rank given the number of annotated symptoms, which was generated using an ordinary least squares model fit to the raw data. The shaded regions indicate the bootstrapped 95% confidence intervals for the expected effective rank (computed using 1,000 re-samples). implemented in sklearn [27]. Importantly, the average precision score is a summary statistic of the precisionrecall curve, which generally maintains utility even in the setting of extremely unbalanced datasets (unlike the receiver operating characteristic curve). This approach was used to identify the cryptic phenotypes for actual rare diseases with excellent results (see Main Figures 2 and 3).

Cryptic Phenotype Evaluation
As discussed in the introduction of this text, not all rare, Mendelian diseases necessarily map to a spectrum of cryptic phenotypic variation, and a variety of artifacts can potentially arise during latent phenotype inference. To mitigate such issues, we took several steps to ensure the cryptic phenotypes inferred using our approach were replicable and consistent. First, cryptic phenotypes were independently inferred in two unique datasets of very different provenance (UCSF and the UKBB). Note, the UKBB is a smaller dataset that is likely depleted for severely ill patients (due to the nature of its recruitment). Therefore, fewer models were successfully inferred in this dataset, and when inference did succeed, the models may be of lower quality. Second, the cryptic phenotypes inferred for each disease were systemically compared across diagnosed cases and controls. If a cryptic phenotype indeed captures Mendelian disease severity, then it should be systematically higher among diagnosed cases. This relationship was assessed by computing bootstrapped estimates of the mean cryptic phenotypes among diagnosed cases and controls [42] with Pvalues computed from the re-sampling statistics. Note, this is only possible for diseases with available ICD10 diagnostic codes. Such information is available for all of the diseases in the UCSF dataset (by design), but some diseases are missing diagnostic codes in the UKBB dataset (see Supplementary Data File 1). When this data is missing, then increased severity among Mendelian disease cases was only assessed in the UCSF dataset. Of the 38 diseases that converged in both datasets, the cryptic phenotypes for 31 were significantly elevated among the diagnosed cases in the UCSF dataset (after correcting for multiple testing). Now, the above analysis only indicates that independent cryptic phenotypes can be inferred in both datasets, but it does assess whether the results are consistent. To ensure replicability and consistency, we applied the cryptic phenotype models inferred in the UKBB dataset to the UCSF dataset. Note, it is not straightforward to perform the inverse analysis (replicate UCSF models in the UKBB), as the UKBB is missing many of the ICD10-CM codes that are utilized by the UCSF models. After applying the UKBB models to the UCSF data, we assessed replicability and consistency in three ways. First, we ensured that the same latent component was identified as the cryptic phenotype in both datasets (true for 24 out of 31 diseases). Next, we assessed if the cryptic phenotypes were systematically elevated among Mendelian disease cases using this new model, and we also ensured that this result replicated in the UKBB (when possible, as ICD10 diagnostic codes are not available for all of the disorders in the UKBB). This was true for 18 of the 24 remaining disorders. Finally, we computed the coefficient of determination (R 2 ) between the cryptic phenotypes inferred by the UKBB and UCSF models within the UCSF dataset 19 . There was a fair degree of variability in R 2 estimates across diseases. In the end, we selected only those cryptic phenotypes with an R 2value greater than or equal to 0.2 (correlation coefficient=0.4) for downstream analyses (see Supplementary Data File 4). This choice was arbitrary, and we suspect that this low threshold may in part explain our inability to identify common variant modifiers for some of the diseases.

The Complete Cryptic Phenotype Analysis Pipeline
A summary of our global approach to cryptic phenotype inference is provided in Supplementary Figure 5. Note, much of the complexity associated with this approach stems from the fact that two independent models were inferred in two unique datasets. Ideally, future iterations of this approach will perform joint inference across multiple datasets, which would negate much of this complexity. However, assessing convergence and replicability will always an important part of any model-based approach to phenotype-driven analyses.

Cryptic Phenotype Replication in Arbitrary Datasets
To facilitate the replication of our results in other datasets, we wrote a small software package that imputes the cryptic phenotypes for the five diseases included in our follow up analyses into arbitrary datasets using flat text files containing ICD10-CM/ICD10-UKBB diagnoses. This software package can be found on Github: https://github.com/daverblair/CrypticPhenoImpute. This package does not perform any model refitting and simply outputs the results of the models that were trained using the UCSF/UKBB datasets. It is of course possible to train new latent phenotype models on existing/independent datasets, but this requires the use of the vlpi software package: https://github.com/daverblair/vlpi. Additional details concerning the use of both software packages can be found on Github.

Fraction Unaffected by Persistent Hematuria
Supplementary Figure 8: Common variation modifies the severity of outcomes associated with Alport Syndrome (AS). (a): Polygenic load is stratified by AS genotype and displayed using a Violin plot. The shaded regions in the plot represent a kernel-density estimate of each PGS distribution, and the boxplot in the interior of the shaded region denotes the interquartile range (box), the median (point), and the overall range (whiskers). Note, there is no association among the genotypes and polygenic burden (linear regression-base two-sided T -test;N =34,800 independent subjects). (b): Proportion of smokers (top; N =34,677 independent subjects) and reported pack-years (bottom; N =24,285 independent subjects) stratified by genotype. The points represent the mean value for each genotype, and the error bars represent the 95% confidence intervals (CIs) for the mean (obtained using bootstrapped re-sampling, N =1,000). P -values summarizing the associations between genotype and smoking history were assessed using Firth-corrected logistic (likelihood-ratio χ 2 test; top) and linear (two-sided T -test) regression. (c): Urine Microalbumin (adjusted for pack-years and baseline covariates) is plotted against PGS quantile and stratified by P/LP genotype. The points and error bars represent the mean values within each quantile and their associated 95% CIs respectively. The association statistics for the marginal PGS and genotype-by-PGS interaction effects are depicted to the right (estimated using linear regression and two-sided T -tests). (d, e): Kaplan-Meier curves for ESRD (d) and Persistent Hematuria (e) onset stratified by P/LP carrier status. Association statistics for the P/LP genotype (estimated using Cox-Proportional Hazards regression, see Methods) are displayed within each plot. The shaded regions represent the 95% confidence intervals for the survival curves. The summary statistics reported in this figure were not adjusted for multiple testing.